Fast-Champollion: A Fast and Robust Sentence Alignment Algorithm
نویسندگان
چکیده
Sentence-level aligned parallel texts are important resources for a number of natural language processing (NLP) tasks and applications such as statistical machine translation and cross-language information retrieval. With the rapid growth of online parallel texts, efficient and robust sentence alignment algorithms become increasingly important. In this paper, we propose a fast and robust sentence alignment algorithm, i.e., FastChampollion, which employs a combination of both length-based and lexiconbased algorithm. By optimizing the process of splitting the input bilingual texts into small fragments for alignment, FastChampollion, as our extensive experiments show, is 4.0 to 5.1 times as fast as the current baseline methods such as Champollion (Ma, 2006) on short texts and achieves about 39.4 times as fast on long texts, and Fast-Champollion is as robust as Champollion.
منابع مشابه
Champollion: A Robust Parallel Text Sentence Aligner
This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese – English parallel corpus show that Champollion achieves high precision and recall on noisy data. Champoll...
متن کاملgpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملA Robust Adaptive Observer-Based Time Varying Fault Estimation
This paper presents a new observer design methodology for a time varying actuator fault estimation. A new linear matrix inequality (LMI) design algorithm is developed to tackle the limitations (e.g. equality constraint and robustness problems) of the well known so called fast adaptive fault estimation observer (FAFE). The FAFE is capable of estimating a wide range of time-varying actuator fault...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملFast Transient Hybrid Neuro Fuzzy Controller for STATCOM During Unbalanced Voltage Sags
A static synchronous compensator (STATCOM) is generally used to regulate voltage and improve transient stability in transmission and distribution networks. This is achieved by controlling reactive power exchange between STATCOM and the grid. Unbalanced sags are the most common type of voltage sags in distribution networks. A static synchronous compensator (STATCOM) is generally used to maintain...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010